# Comun, The Programming Language Specification

**NEXT VERSION DRAFT, WORK IN PROGRESS**

*version TODO, by Miloslav Číž (drummyfish), 2023, released under CC0 1.0, public domain (https://creativecommons.org/publicdomain/zero/1.0/)*

This is a specification of the comun programming language. Currently it is rather informal, a more formal one may be made once the language matures. Many things are yet left unspecified so as to keep this short, simple and allow evolution -- for example error behavior and many commonly used terms are handled without precise definitions, common meanings are assumed. Where something is unspecified, suppose freedom of implementation, try to implement a simple and logical solution that respects the spirit of the language.

**Language naming and compliance**: this is a recommendation for naming of implementations and modifications of the language. The language specified by this document is named *comun*; any implementation that calls itself a *comun* implementation should adhere to all mandatory requirements specified here. It is planned that future versions of this language will continue to be called *comun*, they will be distinguished just by the version number. A simplified version of comun that doesn't have to support preprocessing, file includes, user defined pointers, more than two command blocks in the branch statement and may only implement type environment 0 may be called *minicomun*. A name for versions further simplified to not even support user function definitions is *microcomun* -- such versions may still support calling of external functions. It is recommended different types of modifications choose yet a different name, which may or may not contain the word *comun* as its part.

**Version numbering of this document** has the traditional *MAJOR.MINOR* format. The first two digits of *MINOR* number increase with changes in the specification that make the language different, its further digits increase with minor changes not affecting the language itself (such as typo corrections, formatting etc). *MAJOR* number is incremented and *MINOR* number is set to zero after many significant changes since the start of the current *MAJOR* version. Note that the specification version number and version number of its various implementations are of course generally not the same.

## Basics

The language is minimalist, statically typed, imperative, Turing complete, general purpose and low level, it mainly operates on stacks but also supports pointers for random memory access and implementing multiple stacks. It may be both interpreted and compiled. Postfix notation is used. English keywords are avoided; simple, short math-like symbols are preferred. The language aims for minimum of abstraction above and general similarity to traditional hardware architectures (i.e. easy mapping to assembly languages), it should enable good performance, being future proof, universal and as much as possible free (i.e. public domain, not requiring complex hardware etc.). The following do NOT belong among the language goals: high abstraction (non-imperative paradigms, ...), handholding, rapid programming ("productivity"), parallelism, mainstream popularity, making profit or being beginner friendly.

**Character set**: the language source code may only contain 7bit ASCII characters except for the character with value zero (this character is reserved for potentially terminating the source code string). **Blank character** is any character with ASCII value smaller or equal to that of space (` `); except for the first preprocessing stage, the characters `[` and `]` are also considered blank (no matter where they appear). Platform specific newline sequences in the source code are to be considered a single **newline** character `\n` (ASCII value 10).

**Comments**: during processing of source code if the hash character (`#`) is read and this character is not part of a string literal, the processing considers all characters read blank characters until the end of the next nearest hash character (`#`) or newline (`\n`).

**Tokens**: source code of the language can be seen as a sequence of tokens. Token is a string appearing outside of comment, consisting of only non-blank characters, except for string literals which may contain blank characters but are considered a single token.

**Identifiers**: identifier is a string composed of at least one character and each character of an identifier has to be either a letter (`a` - `z`, `A` - `Z`), decimal digit (`0` - `9`) or underscore (`_`), however the first character cannot be a decimal digit.

**Numeric literals**: numeric literal may start with sign (`+` or `-`). If sign is present, base specifier may follow. Then one to many digits of given base follow. Possible bases are:

- **decimal**: Base specifier is `d`. This is the default base considered if no base specifier is present. Base digits are `0` to `9`. Examples: `0`, `-0`, `+d0123`, `732`.
- **hexadecimal**: Has to start with a sign followed by base specifier `x`. Base digits are `0` to `9` and `a` to `f`. Examples: `+x00ff`, `-x100`.
- **binary**: Has to start with a sign followed by base specifier `b`. Base digits are `0` and `1`. Examples: `+b0`, `-b1010110`.

Let a non-negative decimal literal without base specifiers and without any unnecessary leading zeros be called a **plain decimal numeric literal**.

**String literals**: a string literal is a string that starts and ends with the double quote character (`"`). Between these there is zero to many characters that are not double quote. There are no escape sequences for strings (values not representable in string literals can be pushed to stack as numeric literals).

**Data types**: only atomic integer types that are implicitly considered unsigned exist in the language. A data type is defined only by its bit width *N* which is a non-negative integer. 0 is a special value used to signify the platform's **native data type**, the data type identified by 0 has bit width equivalent to the platform's native integer type if possible, but at least 16 bits. Data type with bit width *N* can hold exactly the values 0 to 2^*N* - 1 (including both bounds). Some built-in operations may interpret memory values as signed; in that case two's complement representation is considered, i.e. results of operations are those that would be obtained in two's complement representation. For unsigned operations direct representation is considered.

Each data type has its own separate **type environment** (TE). A TE is defined in the same way as data type: by its integer non-negative bit-width *N*. This also holds for the data type identified by 0 (the platform's native data type). Note that TE 0 is always separate, i.e. if for example given platform's native type has bit width 16, TE 0 and 16 are still separate. Each TE has its own pointer table and memory consisting of cells, each one holding a value of that environment's type. The cells have integer addresses, first one has address 0, the second one 1 and so on up until a limit imposed by the language processor. At any given point during source code processing exactly one TE is active and the commands of the language operate on this environment's memory. I.e. the currently active TE implies the **currently set data type**. The default TE set at the start of language processing is the platform's native data type (0).

The language doesn't have to support all possible TEs, only TE 0 is required, however it is recommended to also implement at least TE 8, 16 and 32.

**Program arguments**: programs may receive external string arguments for example from the operating system. This passing of arguments can be seen just as if a string *S* was prepended to the source code where *S* is of form `0 Sn ... 0 S3 0 S2 0 S1 N ` where *S1* ... *Sn* are string literals of the arguments and *N* is the number of arguments (if no arguments are passed only a single zero value will be present on the stack at the start of the program). If no parameters are passed or if parameter passing is not supported, a single value 0 still has to be pushed on the stack before program execution starts. If preprocessing is implemented then argument passing to the preprocessor itself may be allowed in the same way.

There must be at least 16 free (usable) cells in each supported TE, starting from the initial main stack top address (however this should be much more if possible), before pushing program arguments.

There is no **standard library**.

## Pointers

A pointer is a name-value pair stored in a pointer table. The pointer name is an identifier (with an exception described below) unique within the pointer table of given TE and its value is a non-negative integer address referring to a specific cell in the TE whose pointer table the pointer resides in. As each TE has a separate pointer table, there can exist pointers with the same name in different TE.

In each TE special pointers named `0`, `1`, ... `9` always exist by default (these pointer names are an exception to pointer names having to be identifiers). Pointers `1`, ... `9` are read-only and can NOT be assigned addresses to directly with commands, their address is always derived from pointer `0`. Pointer `0` always contains the current address at which commands operate, pointer `1` contains address stored in pointer `0` minus 1 etc. As most commands exhibit a stack behavior, we will also refer to the pointer `0` as the **stack top** pointer. The initial addresses stored in pointers, including the pointer `0`, are decided by the language processor.

## Commands

Here we describe the basic unstructured commands of the language, control structures are described in another section.

Built-in operations of the language usually work on the stack of the currently set TE (see pointers and stack top).

Accessing (reading from or writing to) **memory outside bounds** (below 0 or above the maximum address) causes a run time error. Whether leaving memory bounds (without accessing it, e.g. merely moving stack top below address zero) causes an error is left to be decided by each implementation.

Implementation of operations as described below must ensure they do NOT affect values above the final stack top address (because the programmer may have temporarily moved downwards the stack with the intention to return back up later). For example it might be tempting to implement the increment operation (`++`) as a shorthand for pushing the value one and invoking the add command (`1 +`), however this cannot be done for the above mentioned reason.

A **pop** means decrementing stack top pointer. A pop itself mustn't modify any values in memory cells. A **push** of a value means incrementing stack top pointer and then writing the value to the stack top memory cell.

**Type conversions**: the result of an operation that's to be of data type of width *N* is the same as if first all the input operands of were converted to a binary number of infinite bit width (with sign extension if we are considering two's complement), then the operation was performed and lowest *N* bits of the result were taken.

In the following description of commands let *x*, *y* and *z* be the value stored in the cell at the address of pointer `0`, pointer `1` and pointer `2`, respectively, before performing the command, interpreted as an unsigned integer in direct representation unless mentioned otherwise. Operations explicitly mentioned as signed interpret operands as signed integers in two's complement representation.

**Pointers and addresses**: if a command works with an address of a pointer, the address just before invocation of the command is considered (this is significant e.g. with commands that modify stack top address and pop at the same time). If a command that directly modifies a pointer address is applied to one of the special pointers `1`, ... `9`, nothing happens, as these pointers are considered just virtual read-only pointers relative to pointer `0`.

The language commands are

- **numeric literal**: Pushes the value represented by the literal.
- **string literal**: Pushes ASCII values corresponding to the characters in the string literal from back to front, minus the double quote characters. E.g. `"abc"` pushes 99, then 98, then 97.
- identifier (**function call**): Pushes return address on call stack and jumps to executing the corresponding function. For details see the section on language structure.
- `+` (**addition**): Pops 2 values, pushes the value *y + x*.
- `++` (**increment**): Pops 1 value, pushes the value *x + 1*.
- `-` (**subtraction**): Pops 2 values, pushes the value *y - x*.
- `--` (**decrement**): Pops 1 value, pushes the value *x - 1*.
- `*` (**multiplication**): Pops 2 values, pushes the value *y * x*.
- `/` (**division**): Pops 2 values, pushes the value *y / x* (integer division). If *x* is 0, error occurs.
- `//` (**signed division**): Signed operation, pops 2 values, pushes the value *y // x* where *//* signifies signed integer division. If *x* is 0, error occurs.
- `%` (**remainder**): Pops 2 values, pushes the values *y mod x* (remainder after integer division). If *x* is 0, error occurs.
- `%%` (**signed remainder**): Signed operation, pops 2 values, pushes the value *y - (y // x) * x* where *//* is signed integer division. If *x* is 0, error occurs.
- `><` (**swap**): Pops 2 values, pushes the values *x*, then *y*.
- `^` (**pop**): Pops 1 value.
- `=` (**equality**): Pops 2 values. If *x* is equal to *y*, pushes the value 1, otherwise pushes the value 0.
- `!=` (**inequality**): Pops 2 values. If *x* is equal to *y*, pushes the value 0, otherwise pushes the value 1.
- `<` (**smaller inequality**): Pops 2 values. If *y < x*, pushes the value 1, otherwise pushes the value 0.
- `<<` (**smaller signed inequality**): Signed operation, pops 2 values. If *y < x*, pushes the value 1, otherwise pushes the value 0.
- `<=` (**smaller or same inequality**): Same as *smaller inequality*, but the comparison performed is *y <= x*.
- `<<=` (**smaller or same signed inequality**): Same as *smaller signed inequality*, but the comparison performed is *y <= x*.
- `>` (**greater inequality**): Same as *smaller inequality*, but the comparison performed is *y > x*.
- `>>` (**greater signed inequality**): Same as *smaller signed inequality*, but the comparison performed is *y > x*.
- `>=` (**greater or same inequality**): Same as *smaller inequality*, but the comparison performed is *y >= x*.
- `>>=` (**greater or same signed inequality**): Same as *smaller signed inequality*, but the comparison performed is *y >= x*.
- `|` (**bitwise or**): Pops 2 values, pushes the value *y | x* where *|* is a bitwise or operation.
- `&` (**bitwise and**): Pops 2 values, pushes the value *y & x* where *&* is a bitwise and operation.
- `!` (**bitwise not**): Pops 1 value, pushes the value *!x* where *!* is a bitwise negation operation.
- `!!` (**logical not**): Pops 1 value. If x is non-zero, pushed the value 0, otherwise pushes the value 1.
- `|!` (**bitwise xor**): Pops 2 values, pushes the value *y |! x* where *|!* is a bitwise xor operation.
- `|>` (**shift right**): Pops 2 values, pushes the value *y* bit-shifted by *x* to the right (towards least significant bit).
- `|<` (**shift left**): Same as *shift right* but the shift performed is to the left.
- `<-` (**input**): Reads a single value from the input and pushes it. Typically this can mean reading a single text character from input and pushing its ASCII value, but this interpretation is not required. The operation is blocking and doesn't check for end of input. If end of input has been reached and this command is performed, 0 is pushed.
- `->` (**output**): Pops 1 value and outputs it. Typically this can mean printing out a character with ASCII value *x*, but this interpretation is not required.
- `-->`(**string output**): Equivalent to the sequence of commands `@' -> . ^`.
- `<?` (**input unfinished**): Pushes 0 if the previous `<-` command attempted to read from a finished input stream, otherwise 1.
- `??` (**conditional**): Pops 3 values. If *z* is not zero, pushes *y*, otherwise pushes *x*.
- `>N` where *N* is a plain decimal numeric literal (**type environment transfer**): Pops 1 value, then writes this value to the stack top in TE *N* (without shifting the stack top pointer in environment *N*). Currently set data type isn't changed by this command.
- **pointer commands**: Let *N* and *M* be pointer names already declared before the command. Then the following are possible pointer commands:
  - `$N` (**pointer dereference**): Pushes the value stored at the address pointed to by pointer named *N*.
  - `$:N` (**pointer assignment**): Pops 1 value and stores it to the address pointed to by pointer named *N*.
  - `$N>M` (**pointer copy**): Copies the address stored in pointer named *N* to pointer named *M*.
  - `$N=M` (**pointer comparison**): Pushes value 0 if addresses in pointer *N* and *M* are the same, otherwise pushes 1 if address in pointer *N* is greater than address in pointer *M*, otherwise pushes 2.
  - `$>N` (**pointer increment**): Increments the address stored in pointer named *N* by 1.
  - `$<N` (**pointer decrement**): Decrements the address stored in pointer named *N* by 1.
  - `$+N` (**pointer add**): Pops 1 value, interprets it as a signed value and adds it to the address stored in pointer named *N*. If *N* is 0, then the final address in pointer `0` will be that which there was just before invoking this command plus *N* (i.e. pop is ignored here).
  - `$` (**push**): Pops 1 value, pushes the value *x* places below stack top.
  - `$$` (**address**): Pushes the address stored in pointer 0 (stack top).
- **directives**: Directives are compile-time commands that modify the language processor's state or behavior when it is analyzing the code, they don't represent a run time action (for example it is NOT possible to dynamically allocate a pointer with the pointer definition directive). Possible directives are:
  - `~N` where *N* is a plain decimal numeric literal corresponding to one of the supported TEs (**type environment switch**): Makes the language processor immediately change its current TE to *type N*.
  - `~I` or `~I:N` where *I* is identifier and *N* is a plain decimal numeric literal (**pointer definition**): Creates a new pointer within the current TE with the name *I* and size equal to the number represented by *N*. Pointer name must be unique among all pointer names within given TE. The first variant of the command (without explicitly specified *N*) supposes *N* equal to 1. Size *N* here means setting the pointer value so that it points to an address which is followed by at least *N* cells of which none is pointed to by any other pointer. Note that *N* = 0 is a valid value.
  - `~:I` where *I* is identifier (**label**): Creates a new label with the name *I*. The label name must be unique among all label names in the source code. If the goto command isn't supported, this directive doesn't have to be supported.
  - `~S` where *S* is string literal (**file include**): Allows inclusion of source code from another file *F* identified by the string literal *S* which is a file name in given platform's format. If this command is encountered during compilation, it is replaced by the content of file *F*. Inclusion of a file that has already been opened is to be ignored (while issuing possible warning), i.e. any source code file can be processed at most once.
 
Any above mentioned command that performs at least one pop, except for the `-->` command (whose number of pops may be unknown at compile time), has also a **non-popping variant** whose name is formed by appending `'`  to the name of the command. E.g. the command `+'` performs the same operation as command `+` but doesn't perform any pops.

## Structure

**Branch** starts with `?` symbol, then 1 to many blocks of commands separated by `;` and terminated by `.` follow:

```
? COMMANDS_ELSE ; ... ; COMMANDS_2 ; COMMANDS_1 ; COMMANDS_0 .
```

When at run time branch is encountered, a value *V* is popped in the current TE, then either none or one of the command blocks get executed depending on *V*. If there is just one block of commands, then the following happens: if *V* is non-zero, the commands are executed, otherwise they are not executed. If there are more than one command blocks, then the block `COMMANDS_N` get executed if *V* equals *N* and if none such block exists, then `COMMANDS_ELSE` get executed.

NOTE: this form of branching covers both the traditional *if* and *switch* control flow statements, i.e. *switch* statement is just a generalization of *if* statement.

`?'` can be used instead of `?` in which case no pop is performed by the branch condition check.

**Loop** is of form

```
@ COMMANDS .
```

When at run time loop is encountered, a value is popped in the current TE and it is checked whether the popped value was non-zero -- if so, `COMMANDS` are processed and the control jumps right before the loop start.

`!@` is the **loop break** command. When this command is encountered at run time, program control jumps right after the end of the current innermost loop. If this command is encountered outside any loop, an error occurs (it is recommended to be a compile time error but interpreters may also resort to a run time error only upon attempted execution of this command).

`@'` can be used instead of `@` in which case stack top is not popped when checking the loop condition. `@@` can be used instead of `@` to create an infinite loop, i.e. a loop that doesn't perform any condition checks (and no pops) and repeats until exited with the break command.

**Function definition** has the following format

```
functionName: COMMANDS .
```

Here `functionName` is the function name, an identifier unique among all other defined functions, and `COMMANDS` is the function body that is performed when the function is called. Function definition must appear on the outermost level of global scope, i.e. it cannot appear inside another function definition, inside a branch, inside a loop etc.

**Function call** is performed simply with a command equating the function name (argument passing and return values have to be performed via memory cells). Function calls work with a separate return address stack (different from the general stack), i.e. recursion is possible. It is possible to call a function that will be defined after the function call appears (forward declarations aren't needed).

`!.` is the **exit** command. If this command is encountered inside a function, the function is immediately left (just as if its end was reached). If this command is encountered outside a function, it immediately halts the program.

**External functions** can be called; once processing of source code is finished, any call to a function not defined in the source code is considered a call of an external function. **Exposing functions** from the language is also possible; the language processor can make all defined functions callable from external environments. In both cases the conventions for parameter passing and returning values are left to be decided by every implementation. Calling external functions and exposing functions is to serve the compatibility with other languages and computing environments, libraries made within the language itself are to simply use the file include command.

`>S` where *S* is identifier is the **goto** command. Support of this command is optional. When this command is encountered at run time, the program control jumps to the location of label with name *S*. Jumping to a label which will be defined after the goto command is possible. Jumping into functions from outside is possible with behavior matching that of the traditional call stack implementation of function calls; when end of a function that's been jumped into is reached, control jumps to the address on top of call stack (and call stack is popped), unless call stack is empty in which case a run time error occurs.

## Preprocessor

The language may optionally include a preprocessor that allows for automatically modifying the underlying source code before compilation. I.e. with preprocessor activated the source code is first preprocessed to form a new source code which is then compiled. Preprocessor uses the same language as has been defined so far.

Blocks of preprocessing code start with the `[` character and end with the `]` character. Note that these characters are considered preprocessing delimiters everywhere, i.e. even if they appear inside a comment or string literal etc. These characters don't have to be separated by blank characters as language tokens do. Preprocessing blocks cannot be nested.

If preprocessing is active, any source code file's content is to be prepended with the `]` character and appended with the `[` character (this is important for correct behavior during file includes).

Suppose we have source code *S* (with `]` and `[` already prepended/appended and with all include commands already replaced with the corresponding file contents which have also been prepended/appended with the same characters) which is to be preprocessed to form the final preprocessed source code *S2*. *S2* can be seen as obtained from *S* like this:

1. Replace all strings that start with `]` and end with next nearest `[` with language code that prints the string minus the bounding `]` and `[` characters. This code may only modify memory cells above the current stack top in each TE and after execution has to leave addresses in all pointers unchanged. Let the code at this stage be called *S1*. *S1* is the results of the **first preprocessing stage**.
2. Execute *S1*; the output it gives is *S2*. *S2* is the result of the **second preprocessing stage**. This is the final preprocessed code.

Note that this is only a description of semantics, the language preprocessor doesn't actually have to create the intermediate code *S1* if it can generate *S2* in another way.